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Abstract 

In the backbone of large-scale networks, origin-to-destination (OD) traffic flows experience abrupt 
unusual changes known as traffic volume anomalies, which can result in congestion and limit the extent 
to which end-user quality of service requirements are met. As a means of maintaining seamless end- 
user experience in dynamic environments, as well as for ensuring network security, this paper deals 
with a crucial network monitoring task termed dynamic anomalography. Given link traffic measurements 
(noisy superpositions of unobserved OD flows) periodically acquired by backbone routers, the goal is 
to construct an estimated map of anomalies in real time, and thus summarize the network 'health state' 
along both the flow and time dimensions. Leveraging the low intrinsic-dimensionality of OD flows and 
the sparse nature of anomalies, a novel online estimator is proposed based on an exponentially-weighted 
least-squares criterion regularized with the sparsity -promoting £i-norm of the anomalies, and the nuclear 
norm of the nominal traffic matrix. After recasting the non-separable nuclear norm into a form amenable 
to online optimization, a real-time algorithm for dynamic anomalography is developed and its convergence 
established under simplifying technical assumptions. For operational conditions where computational 
complexity reductions are at a premium, a lightweight stochastic gradient algorithm based on Nesterov's 
acceleration technique is developed as well. Comprehensive numerical tests with both synthetic and real 
network data corroborate the effectiveness of the proposed online algorithms and their tracking capabilities, 
and demonstrate that they outperform state-of-the-art approaches developed to diagnose traffic anomalies. 
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I. Introduction 

Communication networks have evolved from specialized, research and tactical transmission systems 
to large-scale and highly complex interconnections of intelligent devices. Thus, ensuring compliance to 
service-level agreements and quality-of-service (QoS) guarantees necessitates ground-breaking manage- 
ment and monitoring tools providing operators with a comprehensive and updated view of the network 
landscape. Situational awareness provided by such tools will be a key enabler for effective information 
dissemination, routing and congestion control, network health management, and security assurance. 

In the backbone of large-scale networks, origin-to-destination (OD) traffic flows experience abrupt 
unusual changes which can result in congestion, and limit QoS provisioning of the end users. These so- 
termed traffic volume anomalies could be due to unexpected failures in networking equipment, cyberattacks 
(e.g., denial of service (DoS) attacks), or, intruders which hijack the network services [35]. Unveiling 
such anomalies in a promptly manner is a crucial monitoring task towards engineering network traffic. 
This is a challenging task however, since the available data are usually high-dimensional, noisy and 
possibly incomplete link-load measurements, which are the superposition of unobservable OD flows. 
Several studies have experimentally demonstrated the low intrinsic dimensionality of the nominal traffic 
subspace, that is, the intuitive low-rank property of the traffic matrix in the absence of anomalies, which 
is mainly due to common temporal patterns across OD flows, and periodic behavior across time [21], 
[41]. Exploiting the low -rank structure of the anomaly-free traffic matrix, a landmark principal component 
analysis (PCA)-based method was put forth in [20] to identify network anomalies; see also [27] for a 
distributed implementation. A limitation of the algorithm in [20] is that it cannot identify the anomalous 
flows. Most importantly, [20] has not exploited the sparsity of anomalies across flows and time - anomalous 
traffic spikes are rare, and tend to last for short periods of time relative to the measurement horizon. 

Capitalizing on the low-rank property of the traffic matrix and the sparsity of the anomalies, the 
fresh look advocated here permeates benefits from rank minimization [8], [9], [11], and compressive 
sampling [10], [12], to perform dynamic anomalography. The aim is to construct a map of network 
anomalies in real time, that offers a succinct depiction of the network 'health state' across both the flow and 
time dimensions (Section II). Special focus will be placed on devising online (adaptive) algorithms that are 
capable of efficiently processing link measurements and track network anomalies 'on the fly'; see also [3] 
for a 'model-free' approach that relies on the kernel recursive LS (RLS) algorithm. Accordingly, the novel 
online estimator entails an exponentially-weighted least-squares (LS) cost regularized with the sparsity- 
promoting ^i-norm of the anomalies, and the nuclear norm of the nominal traffic matrix. After recasting 
the non-separable nuclear norm into a form amenable to online optimization (Section IITA), a real-time 
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algorithm for dynamic anomalography is developed in Section IV based on alternating minimization. 
Each time a new datum is acquired, anomaly estimates are formed via the least-absolute shrinkage and 
selection operator (Lasso), e.g, [17, p. 68], and the low-rank nominal traffic subspace is refined using 
RLS [34]. Convergence analysis is provided under simplifying technical assumptions in Section IV-B. 
For situations were reducing computational complexity is critical, an online stochastic gradient algorithm 
based on Nesterov's accelaration technique [5], [29] is developed as well (Section V-A). The possibility 
of implementing the anomaly trackers in a distributed fashion is further outlined in Section V-B, where 
several directions for future research are also delineated. 

Extensive numerical tests involving both synthetic and real network data corroborate the effectiveness 
of the proposed algorithms in unveiling network anomalies, as well as their tracking capabilities when 
traffic routes are slowly time-varying, and the network monitoring station acquires incomplete link traffic 
measurements (Section VI). Different from [40] which employs a two-step batch procedure to learn the 
nominal traffic subspace first, and then unveil anomalies via £i-norm minimization, the approach here 
estimates both quantities jointly and attains better performance as illustrated in Section VI-B. Concluding 
remarks are given in Section VII, while most technical details relevant to the convergence proof in Section 
IV-C are deferred to the Appendix. 

Notation: Bold uppercase (lowercase) letters will denote matrices (column vectors), and calligraphic letters 
will be used for sets. Operators (•)', tr(-), A m i n (-), [•]+, and E[-], will denote transposition, matrix trace, 
minimum eigenvalue, projection onto the nonnegative orthant, and expectation, respectively; | • | will be 
used for the cardinality of a set, and the magnitude of a scalar. The positive semidefinite matrix M 
will be denoted by M y 0. The £ p -noim of x £ 1" is ||x|| p := (Ya=i M p ) 1/p for V > L For two 
matrices M, U £ M nxn , (M, U) := tr(M'U) denotes their trace inner product. The Frobenious norm of 
matrix M = [mfj] G R nxp is \\M\\ F := y'tr(MM'), ||M|| := max|| x || 2=1 ||Mx|| 2 is the spectral norm, 
||M||i := Ylij \ m i,j\ i s tne ^i-norm, and ||M||* := Yli is the nuclear norm, where (Xj(M) denotes 

the i-th singular value of M. The n x n identity matrix will be represented by I n , while n will stand 
for the n X 1 vector of all zeros, and nX p := n 0' p . 

II. Modeling Preliminaries and Problem Statement 

Consider a backbone Internet protocol (IP) network naturally modeled as a directed graph G(J\f,C), 
where M and C denote the sets of nodes (routers) and physical links of cardinality \J\f\ = N and \C\ = L, 
respectively. The operational goal of the network is to transport a set of OD traffic flows F (with \F\ = F) 
associated with specific source-destination pairs. For backbone networks, the number of network layer 
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flows is much larger than the number of physical links (F S> L). Single -path routing is adopted here, that 
is, a given flow's traffic is carried through multiple links connecting the corresponding source-destination 
pair along a single path. Let r/ j, I G C, f G F, denote the flow to link assignments (routing), which take 
the value one whenever flow / is carried over link I, and zero otherwise. Unless otherwise stated, the 
routing matrix R := [rij] G {0, l} LxF is assumed fixed and given. Likewise, let Zfj denote the unknown 
traffic rate of flow / at time t, measured in e.g., Mbps. At any given time instant t, the traffic carried 
over link I is then the superposition of the flow rates routed through link /, i.e., J2feT r iJ z f,t- 

It is not uncommon for some of the flow rates to experience unusual abrupt changes. These so-termed 
traffic volume anomalies are typically due to unexpected network failures, or cyberattacks (e.g., DoS 
attacks) which aim at compromising the services offered by the network [35]. Let a/^ denote the unknown 
traffic volume anomaly of flow / at time t. In the presence of anomalous flows, the measured traffic carried 
by link I over a time horizon t € [1, T] is then given by 



where the noise variables v^ t account for measurement errors and unmodeled dynamics. 

In IP networks, traffic volume can be readily measured on a per-link basis using off-the-shelf tools such 
as the simple network management protocol (SNMP) supported by most routers. Missing entries in the 
link-level measurements y^ t may however skew the network operator's perspective. SNMP packets may 
be dropped for instance, if some links become congested, rendering link count information for those links 
more important, as well as less available [32]. To model missing link measurements, collect the tuples 
(I, t) associated with the available observations yi t in the set G [1, 2, L] x [1, 2, T]. Introducing the 
matrices Y := [y; )t ],V := [vi jt ] G M ixT , and Z := [zf >t ],A := [af >t ] G R FxT , the (possibly incomplete) 
set of measurements in (1) can be expressed in compact matrix form as 



where the sampling operator Vn(.) sets the entries of its matrix argument not in Q to zero, and keeps 
the rest unchanged. Matrix Z contains the nominal traffic flows over the time horizon of interest. Com- 
mon temporal patterns among the traffic flows in addition to their periodic behavior, render most rows 
(respectively columns) of Z linearly dependent, and thus Z typically has low rank. This intuitive property 
has been extensively validated with real network data; see e.g, [21]. Anomalies in A are expected to 
occur sporadically over time, and last shortly relative to the (possibly long) measurement interval [1,T]. 
In addition, only a small fraction of the flows is supposed to be anomalous at a any given time instant. 
This renders the anomaly traffic matrix A sparse across both rows (flows) and columns (time). 




(1) 



V n (Y) =7> n (R(Z + A)+V) 



(2) 
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Given measurements Vn(Y) adhering to (2) and the binary-valued routing matrix R, the main goal of 
this paper is to accurately estimate the anomaly matrix A, by capitalizing on the sparsity of A and the 
low-rank property of Z. Special focus will be placed on devising online (adaptive) algorithms that are 
capable of efficiently processing link measurements and tracking network anomalies in real time. This 
critical monitoring task is termed dynamic anomalography, and the resultant estimated map A offers a 
depiction of the network's 'health state' along both the flow and time dimensions. If |a/^| > 0, the f- 
th flow at time t is deemed anomalous, otherwise it is healthy. By examining R the network operator 
can immediately determine the links carrying the anomalous flows. Subsequently, planned contingency 
measures involving traffic-engineering algorithms can be implemented to address network congestion. 

III. Unveiling Anomalies via Sparsity and Low Rank 

Consider the nominal link-count traffic matrix X := RZ, which inherits the low-rank property from Z. 
Since the primary goal is to recover A, the following observation model 

Pn(Y)=7>n(X + RA + V) (3) 

can be adopted instead of (2). A natural estimator leveraging the low rank property of X and the sparsity 
of A will be sought next. The idea is to fit the incomplete data Vn(Y) to the model X + RA in 
the least-squares (LS) error sense, as well as minimize the rank of X, and the number of nonzero 
entries of A measured by its £o _ (P seu do) norm. Unfortunately, albeit natural both rank and 4r norm 
criteria are in general NP-hard to optimize [16], [28]. Typically, the nuclear norm ||X||* and the ^i-norm 
||A||i are adopted as surrogates, since they are the closest convex approximants to rank(X) and ||A||o, 
respectively [10], [30], [36]. Accordingly, one solves 

(PI) min i||P n (Y-X-RA)|||. + A,||X||, + Ai||A||i (4) 

{X,A} 2 

where A*, Ai > are rank- and sparsity-controlling parameters. When an estimate d~1 of the noise variance 
is available, guidelines for selecting A* and Ai have been proposed in [42]. Being convex (PI) is appealing, 
and it is known to attain good performance in theory and practice [25]. Also (3) and its estimator (PI) 
are quite general, as discussed in the ensuing remark. 

Remark 1 (Subsumed paradigms): When there is no missing data and X = Olxt, one is left with 
an under-determined sparse signal recovery problem typically encountered with compressive sampling 
(CS); see e.g., [10] and the tutorial account [12]. The decomposition Y = X + A corresponds to principal 
component pursuit (PCP), also referred to as robust principal component analysis (PCA) [8], [13]. PCP was 
adopted for network anomaly detection using flow (not link traffic) measurements in [2]. For the idealized 
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noise-free setting (V = Olxt), sufficient conditions for exact recovery are available for both of the 
aforementioned special cases [8], [10], [13]. However, the superposition of a low-rank plus a compressed 
sparse matrix in (3) further challenges identinability of {X,A}; see [25] for early results. Going back 
to the CS paradigm, even when X is nonzero one could envision a variant where the measurements are 
corrupted with correlated (low-rank) noise [14]. Last but not least, when A = OfxT and Y is noisy, 
the recovery of X subject to a rank constraint is nothing but PCA - arguably, the workhorse of high- 
dimensional data analytics. This same formulation is adopted for low-rank matrix completion, to impute 
the missing entries of a low-rank matrix observed in noise, i.e., Vq(Y) = Vn(X. + V) [11]. 

Albeit convex, (PI) is a non-smooth optimization problem (both the nuclear and t\ -norms are not 
differentiable at the origin). In addition, scalable algorithms to unveil anomalies in large-scale networks 
should effectively overcome the following challenges: (cl) the problem size can easily become quite large, 
since the number of optimization variables is (L + F)T; (c2) existing iterative solvers for (PI) typically 
rely on costly SVD computations per iteration; see e.g., [25]; and (c3) different from the Frobenius and t\- 
norms, (columnwise) nonseparability of the nuclear-norm challenges online processing when new columns 
of Vq(Y) arrive sequentially in time. In the remainder of this section, the 'big data' challenges (cl) and 
(c2) are dealt with to arrive at an efficient batch algorithm for anomalography. Tracking network anomalies 
is the main subject of Section IV. 

To address (cl) and reduce the computational complexity and memory storage requirements of the 
algorithms sought, it is henceforth assumed that an upper bound p > rank(X) is a priori available [X is 
the estimate obtained via (PI)]. As argued next, the smaller the value of p, the more efficient the algorithm 
becomes. Small values of p are well motivated due to the low intrinsic dimensionality of network flows [20]. 
Because rank(X) < p, (Pl)'s search space is effectively reduced and one can factorize the decision variable 
as X = PQ', where P and Q are L x p and T x p matrices, respectively. It is possible to interpret the 
columns of X (viewed as points in M L ) as belonging to a low-rank 'nominal traffic subspace', spanned by 
the columns of P. The rows of Q are thus the projections of the columns of X onto the traffic subspace. 

Adopting this reparametrization of X in (PI), one arrives at an equivalent optimization problem 

(P2) min i||P n (Y-PQ'-RA)|^ + A,||PQ / ||, + Ai||A|| 1 

{P,Q,A}2 

which is non-convex due to the bilinear terms PQ'. The number of variables is reduced from (L + F)T 
in (PI), to p(L + T) + FT in (P2). The savings can be significant when p is small, and both L and T 
are large. Note that the dominant FT-term in the variable count of (P2) is due to A, which is sparse and 
can be efficiently handled even when both F and T are large. 
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A. A separable low-rank regularization 

To address (c2) [along with (c3) as it will become clear in Section IV], consider the following alternative 
characterization of the nuclear norm [30], [31] 

||X||*:=min \ { ||P||| + ||Q|||} , s. t. X = PQ'. (5) 

The optimization (5) is over all possible bilinear factorizations of X, so that the number of columns p 
of P and Q is also a variable. Leveraging (5), the following reformulation of (P2) provides an important 
first step towards obtaining an online algorithm: 

(P3) min i||^(Y-PQ / -RA)||2, + ^{||P||| + ||Q|||}+A 1 ||A|| 1 . 

As asserted in [24, Lemma 1], adopting the separable Frobenius-norm regularization in (P3) comes with 
no loss of optimality relative to (PI), provided p > rank(X). By finding the global minimum of (P3) 
[which could have considerably less variables than (PI)], one can recover the optimal solution of (PI). 
However, since (P3) is non-convex, it may have stationary points which need not be globally optimum. 
Interestingly, the next proposition shows that under relatively mild assumptions on rank(X) and the noise 
variance, every stationary point of (P3) is globally optimum for (PI). For a proof, see [24, App. A]. 

Proposition 1: Let {P,Q, A} be a stationary point of (P3). If \\Vn(Y - PQ' - RA)|| < A*, then 
{X := PQ', A = A} is the globally optimal solution of (PI). 

The qualification condition ^^(Y — PQ' — RA)|| < A* captures tacitly the role of p. In particular, for 
sufficiently small p the residual ||"Pn(Y — PQ' — RA)|| becomes large and consequently the condition 
is violated [unless A* is large enough, in which case a sufficiently low-rank solution to (PI) is expected]. 
The condition on the residual also implicitly enforces rank(X) < p, which is necessary for the equivalence 
between (PI) and (P3). In addition, the noise variance affects the value of ||"Pn(Y — PQ' — RA)||, and 
thus satisfaction of the said qualification inequality. 

B. Batch block coordinate-descent algorithm 

The block coordinate-descent (BCD) algorithm is adopted here to solve the batch non-convex optimiza- 
tion problem (P3). BCD is an iterative method which has been shown efficient in tackling large-scale 
optimization problems encountered with various statistical inference tasks, see e.g., [6]. The proposed 
solver entails an iterative procedure comprising three steps per iteration k = 1, 2, . . . 

[SI] Update the anomaly map: 



A\k + 11 = arg min 

A 



^\\Vn(Y-P[k}Q'[k}-KA)\\ 2 F + X l \\A\ 
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[S2] Update the nominal traffic subspace: 



P \k + 11 = are min 
p 



||P n (Y-PQ , [fc]-RA[A; + l])|||,+ 



A* 



[S3] Update the projection coefficients: 



Q \k + 11 = are min 
Q 



l\\V n (Y - P[fc + 1]Q' - RA[k + 1])||| + ^HQIII. 



To update each of the variable groups, the cost of (P3) is minimized while fixing the rest of the variables to 
their most up-to-date values. The minimization in [SI] decomposes over the columns of A := [ai, . . . , ay]. 
At iteration k, these columns are updated in parallel via Lasso 



&t[k + 1] = argmin 



1, 



n t (y t - P[A;]q t M - Ra)||i + Ai||a||i 



t = l. 



(6) 



where y t and qt[k] respectively denote the £-th column of Y and Q'[A;], while the diagonal matrix 
Qt 6 R LxL contains a one on its l-th diagonal entry if yi t is observed, and a zero otherwise. In practice, 
each iteration of the proposed algorithm minimizes (6) inexactly, by performing one pass of the cyclic 
coordinate-descent algorithm in [17, p. 92]; see Algorithm 1 for the detailed iterations. As shown at the 
end of this section, this inexact minimization suffices to claim convergence to a stationary point of (P3). 

Similarly, in [S2] and [S3] the minimizations that give rise to P[A; + 1] and Q[fc + 1] are separable over 
their respective rows. For instance, the l-th row of the traffic subspace matrix P := [pi, . . . ,pz]' is 
updated as the solution of the following ridge-regression problem 



P; [k + 1] = arg min 



^l((yF)' - p'Q'lk] - (rf)'A[* + i])finii + y ||p||jj 



(7) 



where (y[)' and (r[)' represent the Z-th row of Y and R, respectively. The t-th diagonal entry of the 
diagonal matrix fi[ G M TxT is an indicator variable testing whether measurement y^ t is available. A 
similar regularized LS problem yields qt[k + 1], t = 1, . . . , T; see Algorithm 1 for the details and a 
description of the overall BCD solver. The novel batch scheme for unveiling network anomalies is less 
complex computationally than the accelerated proximal gradient algorithm in [25], since Algorithm l's 
iterations are devoid of SVD computations. 

Despite being non-convex and non-differentiable, (P3) has favorable structure which facilitates conver- 
gence of the iterates generated by Algorithm 1. Specifically, the resulting cost is convex in each block 
variable when the rest are fixed. The non-smooth £i-norm is also separable over the entries of its matrix 
argument. Accordingly, [37, Theorem 5.1] guarantees convergence of the BCD algorithm to a stationary 
point of (P3). This result together with Proposition 1 establishes the next claim. 
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Algorithm 1 : Batch BCD algorithm for unveiling network anomalies 
input P n (Y),fi,R,A*, and Ai. 

initialize P[l] and Q[l] at random. 

for k = 1,2,. . . do 

[SI] Update the anomaly map: 

for / = 1,...,F do 

y[- f) [k + 1] = n t (y t - P[k]qt[k) - E/^i «W,t[* + 1] - E/'=/+i '/'<>/'.*[*]). * = 1, • • - 
a /lt [* + 1] = sign(^y(-^ [fc + 1]) [|r>y^ _/) [k + 1]| - A x ] + /||r / || 2 , t = 1, . . . , T. 
end for 

A[fc + 1] = [[oi,i[A + 1], • • . , a F ,i[fc + 1]]', . . . , [oi,r[A + !],•••, arf + 1]]'] . 
[S2] Update the nominal traffic subspace: 

pi[k + 1] = (A.I, + Q'^nfQtfe])- 1 Q'[fc]nf (y[ - A'[fc + l]r[), I = 1,...,L. 
P[fc + 1] = [p 1 [* + l] I ...,p i [* + l]] / . 
[S3] Update the projection coefficients: 

q*[* + l] = (A.I / , + P / [*+l]n t P[A + l]r 1 P'[A+l]n t (y t -Ra t [* + l]), t=l,...,T. 
Q[k + 1] = [qi[* + l],...,q T [A+l]]'. 
end for 

return A := A[oo] and X := P[oo]Q'[oo]. 



Proposition 2: If a subsequence {X[/c] := P[fc]Q'[fe], A[fc]} of iterates generated by Algorithm 1 satisfies 
\\Vn(Y — X[fc] — RA[fc])|| < A*, then it converges to the optimal solution set of (PI) as k — > oo. 

In practice, it is desirable to monitor anomalies in real time and accomodate time-varying traffic routes. 
These reasons motivate devising algorithms for dynamic anomalography, the subject dealt with next. 

IV. Dynamic Anomalography 

Monitoring of large-scale IP networks necessitates collecting massive amounts of data which far out- 
weigh the ability of modern computers to store and analyze them in real time. In addition, nonstationarities 
due to routing changes and missing data further challenge identification of anomalies. In dynamic networks 
routing tables are constantly readjusted to effect traffic load balancing and avoid congestion caused by 
e.g., traffic anomalies or network infrastructure failures. To account for slowly time-varing routing tables, 
let R t G M Lxi? denote the routing matrix at time t l . In this dynamic setting, the partially observed link 

'Fixed size routing matrices Rt are considered here for convenience, where L and F correspond to upper bounds on the 
number of physical links and flows transported by the network, respectively. If at time t some links are not used at all, or, less 
than F flows are present, the corresponding rows and columns of R t will be identically zero. 
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counts at time t adhere to [cf. (3)] 

^n t (yt)=7 , n t (xt + Rta«+v t ), 4 = 1,2,... (8) 

where the link-level traffic x t := Rtz t , for z t from the (low-dimensional) traffic subspace. In general, 
routing changes may alter a link load considerably by e.g., routing traffic completely away from a specific 
link. Therefore, even though the network-level traffic vectors {zj} live in a low-dimensional subspace, 
the same may not be true for the link-level traffic {x t } when the routing updates are major and frequent. 
In backbone networks however, routing changes are sporadic relative to the time-scale of data acquisition 
used for network monitoring tasks. For instance, data collected from the operation of Internet-2 network 
reveals that only a few rows of Ht change per week [1]. It is thus safe to assume that {xt} still lies 
in a low-dimensional subspace, and exploit the temporal correlations of the observations to identify the 
anomalies. 

On top of the previous arguments, in practice link measurements are acquired sequentially in time, 
which motivates updating previously obtained estimates rather than re-computing new ones from scratch 
each time a new datum becomes available. The goal is then to recursively estimate {if, a 4 } at time t from 
historical observations {Vn T (y T ), Q T Y T=1 , naturally placing more importance to recent measurements. To 
this end, one possible adaptive counterpart to (P3) is the exponentially-weighted LS estimator found by 
minimizing the empirical cost 



t 

min V/3*" T 
{P,Q,A}^ 



2 2 l^u=lP 2 



(9) 



|2 i n * 1(131(2 i 1 1 2 

T=l L 

in which < /3 < 1 is the so-termed forgetting factor. When j3 < 1 data in the distant past are exponentially 
downweighted, which facilitates tracking network anomalies in nonstationary environments. In the case 
of static routing (R t = R, t = 1, 2, . . .) and infinite memory (fi = 1), the formulation (9) coincides with 
the batch estimator (P3). This is the reason for the time-varying factor weighting ||P||^. 

A. Tracking network anomalies 

Towards deriving a real-time, computationally efficient, and recursive solver of (9), an alternating 
minimization method is adopted in which iteration k coincides with the time scale t of data acquisition. A 
justification in terms of minimizing a suitable approximate cost function is discussed in detail in Section 
IV-B. Per time instant t, a new datum {Vn t {yt), ^t} is drawn and {q t , a t } are jointly estimated via 



{q[i],a[i]} = arg min 
{q 5 a} 



^\\Vn t (yt ~ P[* " l]q " R*a)||| + ^||q||| + AilM^ 



(10) 
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It turns out that (10) can be efficiently solved. Fixing a to carry out the minimization with respect to q 
first, one is left with an £2 -norm regularized LS (ridge-regression) problem 

2 



q[t] = arg min 



^||Pn t (y 4 -P[t-l]q-R t a)||^ + ^||q|| 



(A*I P + P'[t - l]Sl t P[t - 1]) 1 P'[t - l]Vu t (yt - R*a). 



(11) 



Note that q[t] is an affme function of a, and the update rule for q[t] is not well defined until a is replaced 
with a[t]. Towards obtaining an expression for a[t], define D[t] := (AJ P + P[t - l]n t P'[< - 1]) P'[t- 
1] for notational convenience, and substitute (11) back into (10) to arrive at the Lasso estimator 



a[t] = arg min 



-||F[t](y t - Rta)||l + Ai||a||i 



(12) 



where F[t] := [Cl t - fl t P[t - l]B[t]fl t , y/Kflt'D'it}] . The diagonal matrix Sl t was defined in Section 
IH-B, see the discussion after (6). 

In the second step of the alternating-minimization scheme, the updated subspace matrix P [t] is obtained 
by minimizing (9) with respect to P, while the optimization variables {q T ,a r }^. =1 are fixed and take the 
values {q[r], a[r]}^. =1 . This yields 



P[t] 



arg mm 



A* 

T 



* 1 

|P||| + £ P^-WPiKfrr ~ Pq[r] - R T a[r] 



T=l 



Similar to the batch case, (13) decouples over the rows of P which are obtained in parallel via 

t 



pi [t] = arg min 



A 



fllPlI 2 +J2^lAm,r - P'qM - (r[, T )'a[r]) 5 



r=l 



1 = 1,. ..,L 



(13) 



(14) 



where coi >T denotes the Z-th diagonal entry of Q T . For (3 = 1, subproblems (14) can be efficiently 
solved using the RLS algorithm [34]. Upon defining si[t] := Y? T =l fi t ~ T ^i^Vi.t — r i T a i T ]) c l[' r ]> Hj[t] := 
Y? T =i /^kVqMq'fr] + A*I P , and Mi[t] := Hj -1 ^], with /3 = 1 one simply updates 

Bi[t] = Bi[t - 1] + ui tt (yi,t - < t a[t])q[t] 

Mj t - M;[t - 1 - t — — 

l + q'[tjM z [t- ljq[tj 

and forms pi[t] = M.i[t]si[t), for / = 1, . . . , L. 

However, for < /3 < 1 the regularization term (A*/2)||p|| 2 in (14) makes it impossible to express 
Hj[t] in terms of H/[t — 1] plus a rank-one correction. Hence, one cannot resort to the matrix inversion 
lemma and update Mj[i] with quadratic complexity only. Based on direct inversion of H/[i], I = 1, . . . , L, 
the overall recursive algorithm for tracking network anomalies is tabulated under Algorithm 2. The per 
iteration cost of the L inversions (each 0(p 3 ), which could be further reduced if one leverages also the 
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Algorithm 2 : Online algorithm for tracking network anomalies 
input {Pn t (y t ),a,Rt} t =i,/3,A,, and Ai. 

initialize G;[0] = pxp , s;[0] = P , I = 1, and P[0] at random, 
for £ = 1,2,... do 

D[t] = {Kip + P'[t - l]OtP[* - I])" 1 P'[* - 1]. 
F[t] = [fh - fhP[t - l]D[t]nt, vCTtD'M] '. 
a[i] = argmin a [|||F[f](y t - R t a)|| 2 + Ai||a||i]. 
q[i] = D[i]n t (y t -R t a[t]). 
G,[t] =/3G i [i-l]+w i , i q[t]q[t] , J Z = 1,...,L. 
sj[t] = 0s, [t - 1] + Wi,t(»i, t - rj |t a[t])q[t], I = 1, . . . , L. 
pi[t] = (Gi[t]+A*I p ) -1 Bj[t], I = 1,...,L. 
return a t := a[i] and ic t := P[<]q[t]. 
end for 



symmetry of Hj[t]) is affordable for moderate number of links, because p is small when estimating low- 
rank traffic matrices. Still, for those settings where computational complexity reductions are at a premium, 
an online stochastic gradient descent algorithm is described in Section V-A. 

Remark 2 (Robust subspace trackers): Algorithm 2 is closely related to timely robust subspace track- 
ers, which aim at estimating a low-rank subspace P from grossly corrupted and possibly incomplete data, 
namely Vn t (yt) = 'Pn t (Pq.t + a t + Vt), t = 1,2,.... In the absence of sparse 'outliers' {a*}^, an 
online algorithm based on incremental gradient descent on the Grassmannian manifold of subspaces was 
put forth in [4]. The second-order RLS-type algorithm in [15] extends the seminal projection approximation 
subspace tracking algorithm [39] to handle missing data. When outliers are present, robust counterparts 
can be found in [14], [18], [26]. Relative to all aforementioned works, the estimation problem here is more 
challenging due to the presence of the fat (compression) matrix H t ; see [25] for fundamental identifiability 
issues related to the model (3). 

B. Convergence Analysis 

This section studies the convergence of the iterates generated by Algorithm 2, for the infinite memory 
special case i.e., when (3 = 1. Upon defining the function 

St(P,q,a) := \\\Vn t {y t - Pq- R t a)||i + ^||q||| + Ai||a||i 
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in addition to £t(P) '■= min{ q a } <^(P, q, a), the online solver of Section IV-A aims at minimizing the 
following average cost function at time t 

Q(P):=ij>(P) + ^||P||i. (15) 

T=l 

Normalization (by t) ensures that the cost function does not grow unbounded as time evolves. For any 
finite t, (15) it is essentially identical to the batch estimator in (P3) up to a scaling, which does not affect 
the value of the minimizers. Note that as time evolves, minimization of Ct becomes increasingly complex 
computationally. Even evaluating Ct is challenging for large t, since it entails solving t Lasso problems 
to minimize all g T and define the functions i T , r = 1, . . . , T. Hence, at time t the subspace estimate P[t] 
is obtained by minimizing the approximate cost function 

Ci(P) = j J>(P,q[r],a[T]) + ^||P||| (16) 

r=l 

in which {q[i],a[t]} are obtained based on the prior subspace estimate P[t — 1] after solving [cf. (10)] 

{q[t],a[t]} =argmin 5 i(P[t-l],q,a). (17) 
{q.a} 

Obtaining q[i] this way resembles the projection approximation adopted in [39], and can only be eval- 
uated after &[t] is obtained [cf. (11)]. Since Ct(P) is a smooth convex function, the minimizer P[t] = 
argminp Ct(P) is the solution of the quadratic equation VCt(P[i]) = 0l X/3 . 

So far, it is apparent that the approximate cost function C%(P[i]) overestimates the target cost Ct(P[i]), 
for t = 1,2,.... However, it is not clear whether the dictionary iterates {Pfi]}^ converge, and most 
importantly, how well can they optimize the target cost function Cf. The good news is that d%(P[i]) 
asymptotically approaches Ct(P[i]), and the subspace iterates null VCt(P[t]) as well, both as t — > oo. 
The latter result is summarized in the next proposition, which is proved in the next section. 

Proposition 3: Assume that: al) {£lt}uLi and {y*}^i are independent and identically distributed (i.i.d.) 
random processes; a2) \\Vn t (yt)\\oo is uniformly bounded; a3) iterates {P[t]}^ 1 are in a compact set 
C C M Lxp ; a4) Ct(P) is positive definite, namely A m i n V 2 C%(P) > c for some c > 0; and a5) the 
Lasso (12) has a unique solution. Then lim^oo VCi(P[t]) = Oixp almost surely (a.s.), i.e., the subspace 
iterates {P[t]}^ 1 asymptotically coincide with the stationary points of the batch problem (P3). 

To clearly delineate the scope of the analysis, it is worth commenting on the assumptions al)-a5) 
and the factors that influence their satisfaction. Regarding al), the acquired data is assumed statistically 
independent across time as it is customary when studying the stability and performance of online (adaptive) 
algorithms [34]. Still, in accordance with the adaptive filtering folklore, as /3 — > 1 the upshot of the analysis 
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based on i.i.d. data extends accurately to the pragmatic setting whereby the link-counts and missing data 
patterns exhibit spatiotemporal correlations. Uniform boundedness of Vn t (yt) [cf. a2)] is satisfied in 
practice, since the traffic is always limited by the (finite) capacity of the physical links. The bounded 
subspace requirement in a3) is a technical assumption that simplifies the arguments of the ensuing proof, 
and has been corroborated via computer simulations. It is apparent that the sampling set U t plays a key 
role towards ensuring that a4) and a5) are satisfied. Intuitively, if the missing entries tend to be only few 
and somehow uniformly distributed across links and time, they will not markedly increase coherence of 
the regression matrices F[i]R t , and thus compromise the uniqueness of the Lasso solutions. This also 
increases the likelihood that V 2 Cf(P) = ^I Lp + \ £t=i(q[r]q'[r]) ® n r h & Lp holds. As argued 
in [23], if needed one could incorporate additional regularization terms in the cost function to enforce a4) 
and a5). Before moving on to the proof, a remark is in order. 

Remark 3 (Performance guarantees): In line with Proposition 2, one may be prompted to ponder 
whether the online estimator offers the performance guarantees of the nuclear-norm regularized estimator 
(PI), for which stable/exact recovery have been well documented e.g., in [8], [25], [42]. Specifically, given 
the learned traffic subspace P and the corresponding Q and A [obtained via (10)] over a time window 
of size T, is {X := PQ', A := A} an optimal solution of (PI) when T — > oo? This in turn requires 
asymptotic analysis of the optimality conditions for (PI), and is left for future research. Nevertheless, 
empirically the online estimator attains the performance of (PI), as evidenced by the numerical tests in 
Section VI. 

C. Proof of Proposition 3 

The main steps of the proof are inspired by [23], which studies convergence of an online dictionary 
learning algorithm using the theory of martingale sequences; see e.g., [22]. However, relative to [23] the 
problem here introduces several distinct elements including: i) missing data with a time-varying pattern O t ; 
ii) a non-convex bilinear term where the tall subspace matrix P plays a role similar to the fat dictionary 
in [23], but the multiplicative projection coefficients here are not sparse; and iii) the additional bilinear 
terms which entail sparse coding of at as in [23], but with a known regression (routing) matrix. Hence, 
convergence analysis becomes more challenging and demands, in part, for a new treatment. Accordingly, 
in the sequel emphasis will be placed on the novel aspects specific to the problem at hand. 

The basic structure of the proof consists of three preliminary lemmata, which are subsequently used 
to establish that Hindoo VCi(P[i]) = O^xp a.s. through a simple argument. The first lemma deals with 
regularity properties of functions C t and Ct, which will come handy later on; see Appendix A for a proof. 
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Lemma 1: If a2) and a5) hold, then the functions: i) {a^(P), qj (P)} = argmiii{ q a } gt(P, q, a), ii) 
gt(P,q[t],&[i\), Hi) ^t(P), and iv) V-^(P) are Lipschitz continuous for P £ C (X is a compact set), with 
constants independent of t. 

The next lemma (proved in Appendix B) asserts that the distance between two subsequent traffic subspace 
estimates vanishes as t — > oo, a property that will be instrumental later on when establishing that C%(P[i]) — 
C t (P[t}) -> a.s. 

Lemma 2: If o2)-a5) hold, then \\P[t + 1] - P[t]\\ F = 0(1 ft). 

The previous lemma by no means implies that the subspace iterates converge, which is a much more 
ambitious objective that may not even hold under the current assumptions. The final lemma however, 
asserts that the cost sequence indeed converges with probability one; see Appendix C for a proof. 

Lemma 3: If al)-a5) hold, then C t (P[t]) converges a.s. Moreover, Ct(P[t]) — Cf(P[t]) — > a.s. 

Putting the pieces together, in the sequel it is shown that the sequence {VC%(P[i]) — VC^Pft])}^ 
converges a.s. to zero, and since VC%(P[i]) = 0l xp by algorithmic construction, the subspace iterates 
{P[t]}^i coincide with the stationary points of the target cost function Cj. To this end, it suffices to 
prove that every convergent subsequence nulls the gradient VCt asymptotically, which in turn implies that 
the entire sequence converges to the set of stationary points of the batch problem (P3). 

Since C is compact by virtue of a3), one can always pick a convergent subsequence {P[t]}^ 1 whose 
limit point is P*, say 2 . Consider the positive-valued decreasing sequence {o^}^ (that necessarily con- 
verges to zero), and recall that (P[t] +a^U) > (P[t] +atU) for any UGl^. From the mean-value 
theorem and for arbitrary U, expanding both sides of the inequality around the point P[t] one arrives at 

C t (P[t]) + a t tr{UVQ(P[t])} + ^Mu'V^^ijU} > 

C t (P[t}) + a t tr{U'VCt(P[t])} + ^a?tr{U / V 2 C7 4 (0 2 )U} 

for some ©i, @2 £ M. Lxp and all t. Taking limit as t — > oo and applying Lemma 3 it follows that 

lim tr{U'(VC t (P[t]) - VC t (P[t]))} + lim laMV^C^) - V 2 C t (0 2 ))U} > 0. (18) 

One can readily show that V 2 t%(©i) = \ £t=i(q[ r h'M) ^ ft T + ^1 L P is bounded since P[i] is 
uniformly bounded [cf. a2)]. Consequently, lim^oo ^aitr{U'(V 2 C7((©i))U} = 0. Furthermore, since V£ T 
is Lipschitz as per Lemma 1, VC t is Lipschitz as well and it follows that lim^oo iaftr{U'V 2 Ct(@2)U} = 

2 Formally, the subsequence should be denoted as {Plt^)]}^, but a slight abuse of notation is allowed for simplicity. 
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0. All in all, the second term in (18) vanishes and one is left with 

lim tr{U'(VQ(P t ) - VC t (P t ))} > 0. (19) 

t— >oo 

Because U e R Lx p is arbitrary, (19) can only hold if lim t _ ) . 00 (VC' t (P[t]) - VC t (P[t})) = Lxp a.s., 
which completes the proof. ■ 

V. Further Algorithmic Issues 

For completeness, this section outlines a couple of additional algorithmic aspects relevant to anomaly 
detection in large-scale networks. Firstly, a lightweight first-order algorithm is developed as an alternative 
to Algorithm 2, which relies on fast Nesterov-type gradient updates for the traffic subspace. Secondly, the 
possibility of developing distributed algorithms for dynamic anomalography is discussed. 



A. Fast stochastic-gradient algorithm 

Reduction of the computational complexity in updating the traffic subspace P is the subject of this 
section. The basic alternating minimization framework in Section IV-A will be retained, and the updates 
for {q[i],a[i]} will be identical to those tabulated under Algorithm 2. However, instead of solving an 
unconstrained quadratic program per iteration to obtain P[t] [cf. (13)], the refinements to the subspace 
estimate will be given by a (stochastic) gradient algorithm. 

As discussed in Section IV-B, in Algorithm 2 the subspace estimate P[t] is obtained by minimizing the 
empirical cost function C7 f (P) = (1/t) J2t=i /t(P)> where 

f t (P) := ±\\n t (y t - Pq[t] - RtafeDIl! + ^||P||| + y ||q[*]||! + Xi\\a[t]\\i, t = l,2,... (20) 

By the law of large numbers, if data {Vn t (yt)}t^i w& stationary, solving minp limt_ > . 00 C*(P) yields 
the desired minimizer of the expected cost E[C t (P)], where the expectation is taken with respect to the 
unknown probability distribution of the data. A standard approach to achieve this same goal - typically 
with reduced computational complexity - is to drop the expectation (or the sample averaging operator for 
that matter), and update the nominal traffic subspace via a stochastic gradient iteration [34] 

P[t] = argimnQ (1/AW)>t (P,P[t - 1]) 

= P[i - 1] - £[t]V/t(P[* - 1]) (21) 

where fi[t] is a stepsize, Q M)t (Pi,P 2 ) := /t(P 2 ) + (Pi - P 2 , V/ t (P 2 )> + g||Pi - P 2 ||J, and V/*(P) = 
— dt(yt— Pq[t] — Rta[t])q'[t] + (A*/t)P. In the context of adaptive filtering, stochastic gradient algorithms 
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Algorithm 3 : Online stochastic gradient algorithm for unveiling network anomalies 
input {y^R^fit}^, p, A*,Ai,r?> 1. 

initialize P[0] at random, /Lt[0] > 0, P[l] := P[0], and fc[l] := 1. 
for t = 1,2,... do 

D[t] = (A*I, + P'[< - l]OtP[* - I]) -1 P'[* - 1] 

F'[t] := [ft t - totP[t - l]D[t}n u y/Kft t B'[t]] 

a[t] = argmin a [|||F[i](y t - R t a)|| 2 + Ai||a||i] 

q[t] =D[i]n t (y t -R t a t ) 

Find the smallest nonnegative integer i[t] such that with fl := rj l W(j,[t — 1] 

f t (P[t] - (l//i)V/ t (P>])) < Q A , t (P[t] - (l/A)V/tCPM),P[t]) 

holds, and set ^[i] = i]' l ^fi[t - 1]. 
P[t] = P[t]-( l/Mt])V /«(P[t]). 

fc[ i + i ] = ib^i. 

p[t + i] = P[t] + (j$bJ) (p[t]-p[t-i]). 

end for 

return x[t] := P[t]q[<], a[<] := a[<]. 



such as (20) are known to converge typically slower than RLS. This is expected since RLS can be shown 
to be an instance of Newton's (second-order) optimization method [34]. 

Building on the increasingly popular accelerated gradient methods for (batch) smooth optimization [5], 
[29], the idea here is to speed-up the learning rate of the estimated traffic subspace (21), without paying 
a penalty in terms of computational complexity per iteration. The critical difference between standard 
gradient algorithms and the so-termed Nesterov's variant, is that the accelerated updates take the form 
P[t] = P[t]— ju[i]V/t(P[£]), which relies on a judicious linear combination P[£— 1] of the previous pair of 
iterates {P[t-l],P[t-2]}. Specifically, the choice P [t] = P[f-l] + fc[t ~^j~ 1 (P[t- 1] - P[< - 2]), where 



k[t] = 1 + \J Ak 2 [t — 1] + 1 /2, has been shown to significantly accelerate batch gradient algorithms 
resulting in convergence rate no worse than 0(l/k 2 ); see e.g., [5] and references therein. Using this 
acceleration technique in conjunction with a backtracking stepsize rule [6], a fast online stochastic gradient 
algorithm for unveiling network anomalies is tabulated under Algorithm 3. Different from Algorithm 2, 
no matrix inversions are involved in the update of the traffic subspace P[i]. Clearly, a standard (non 
accelerated) stochastic gradient descent algorithm with backtracking stepsize rule is subsumed as a special 
case, when k[t] = 1, t = 0, 1, 2, . . . 
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Convergence analysis of Algorithm 3 is beyond the scope of this paper, and will only be corroborated 
using computer simulations in Section VI. It is worth pointing out that since a non-diminishing stepsize 
is adopted, asymptotically the iterates generated by Algorithm 3 will hover inside a ball centered at the 
minimizer of the expected cost, with radius proportional to the noise variance. 

B. In-network anomaly trackers 

Implementing Algorithms 1-3 presumes that network nodes continuously communicate their local link 
traffic measurements to a central monitoring station, which uses their aggregation in {Pn t (yt)}^i to 
unveil network anomalies. While for the most part this is the prevailing operational paradigm adopted 
in current network technologies, it is fair to say there are limitations associated with this architecture. 
For instance, collecting all this information centrally may lead to excessive protocol overhead, especially 
when the rate of data acquisition is high at the routers. Moreover, minimizing the exchanges of raw 
measurements may be desirable to reduce unavoidable communication errors that translate to missing 
data. Performing the optimization in a centralized fashion raises robustness concerns as well, since the 
central monitoring station represents an isolated point of failure. 

These reasons motivate devising fully-distributed iterative algorithms for dynamic anomalography in 
large-scale networks, embedding the network anomaly detection functionality to the routers. In a nutshell, 
per iteration nodes carry out simple computational tasks locally, relying on their own link count measure- 
ments (a few entries of the network-wide vector y t corresponding to the router links). Subsequently, local 
estimates are refined after exchanging messages only with directly connected neighbors, which facilitates 
percolation of local information to the whole network. The end goal is for network nodes to consent on 
a global map of network anomalies, and attain (or at least come close to) the estimation performance of 
the centralized counterpart which has all data {Vn t (yt)}^2.i available. 

Relying on the alternating-directions method of multipliers (AD-MoM) as the basic tool to carry 
out distributed optimization, a general framework for in-network sparsity-regularized rank minimization 
was put forth in a companion paper [24]. In the context of network anomaly detection, results therein 
are encouraging yet there is ample room for improvement and immediate venues for future research 
open up. For instance, the distributed algorithms of [24] can only tackle the batch formulation (P3), so 
extensions to a dynamic network setting, e.g., building on the ideas here to devise distributed anomaly 
trackers seems natural. To obtain desirable tradeoffs in terms of computational complexity and speed 
of convergence, developing and studying algorithms for distributed optimization based on Nesterov's 
acceleration techniques emerges as an exciting and rather pristine research direction; see [19] for early 
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Fig. 1. Synthetic network topology graph, and the paths used for routing three flows. 



work dealing with separable batch optimization. 

VI. Performance Tests 

Performance of the proposed batch and online estimators is assessed in this section via computer 
simulations using both synthetic and real network data. 

A. Synthetic network data tests 

Synthetic network example. A network of N = 15 nodes is considered as a realization of the random 
geometric graph model with agents randomly placed on the unit square, and two agents link if their 
Euclidean distance is less than a prescribed communication range of d c = 0.35; see Fig. 1. The network 
graph is bidirectional and comprises L = 52 links, and F = N(N—1) = 210 OD flows. For each candidate 
OD pair, minimum hop count routing is considered to form the routing matrix R. Entries of Vf are i.i.d., 
zero-mean, Gaussian with variance a 2 ; i.e., vij ~ N(0,a 2 ). Flow-traffic vectors z t are generated from 
the low-dimensional subspace U £ R Fxr with i.i.d. entries uj $ ~ AA(0, 1/F), and projection coefficients 
Wi t t ~ N(0, 1) such that z t = Uwj. Every entry of a 4 is randomly drawn from the set {—1,0, 1}, with 
Pr(a/ i j = — 1) = Pr(a/ j t = 1) = p/2. Entries of Y are sampled uniformly at random with probability it 
to form the diagonal sampling matrix Q t . The observations at time instant t axe generated according to 
'Pfitiyt) = ^t(Rz t + Ra t + vt). Unless otherwise stated, r = 2, p = 5, and /3 = 0.99 are used throughout. 
Different values of a, p and tt are tested. 
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Performance of the batch estimator. To demonstrate the merits of the batch BCD algorithm for unveiling 
network anomalies (Algorithm 1), simulated data are generated for a time interval of size T = 100. For 
validation purposes, the benchmark estimator (PI) is iteratively solved by alternating minimization over 
A (which corresponds to Lasso) and X. The minimizations with respect to X can be carried out using the 
iterative singular-value thresholding (SVT) algorithm [7]. Note that with full data, SVT requires only a 
single SVD computation. In the presence of missing data however, the SVT algorithm may require several 
SVD computations until convergence, rendering the said algorithm prohibitively complex for large-scale 
problems. In contrast, Algorithm 1 only requires simple p x p inversions. Fig. 2 (a) depicts the convergence 
of the respective algorithms used to solve (PI) and (P3), for different amounts of missing data (controlled 
by 7r). It is apparent that both estimators attain identical performance after a few tens of iterations, as 
asserted by Proposition 1 . To corroborate the effectiveness of Algorithm 1 in unveiling network anomalies 
across flows and time, Fig. 2 (b) maps out the magnitude of the true and estimated anomalies when 
o = 10~ 2 and n = 1. To discard spurious estimates, consider the hypothesis test af t t ^^J 1 0.1, with 
anomalous and anomaly-free hypotheses T-L\ and T~Lq, respectively. The false alarm and detection rates 
achieved are then PpA = 0.0011 and Pd = 0.947, respectively. 

Performance of the online algorithms. To confirm the convergence and effectiveness of the online 
Algorithms 2 and 3, simulation tests are carried out for infinite memory /3 = 1 and invariant routing 
matrix R. Figure 3 (a) depicts the evolutions of the average cost Cj(Pt) in (15) for different amounts of 
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Fig. 3. Performance of the online estimator for a — 1CP 2 , p = 0.005, Ai = 0.11, and A* = 0.36. (a) Evolution of the average 
cost Ct(P[£]) of the online algorithms versus the batch counterpart (P3). (b) Amplitude of true (solid) and estimated (circle 
markers) anomalies via the online Algorithm 2, for three representative flows when n — 1 (no missing data). 



missing data ir = 0.75, 1 when the noise level is a = 10~ 2 . It is evident that for both online algorithms 
the average cost converges (possibly within a ball) to its batch counterpart in (P3) normalized by the 
window size T = t. Impressively, this observation together with the one in Fig. 2 (a) corroborate that 
the online estimators can attain the performance of the benchmark estimator, whose stable/exact recovery 
performance is well documented e.g., in [9], [25], [42]. It is further observed that the more data are 
missing, the more time it takes to learn the low-rank nominal traffic subspace, which in turn slows down 
convergence. 

To examine the tracking capability of the online estimators, Fig. 3 (b) depicts the estimated versus 
true anomalies over time as Algorithm 2 evolves for three representative flows indicated on Fig. 1, 
namely /2, /6, /g corresponding to the / = 2,6,9-th rows of Aq. Setting the detection threshold to 
the value 0.1 as before, for the flows fi,f&,h Algorithm 2 attains detection rate Jfo = 0.83,1,1 at 
false alarm rate Pfa = 0.0171,0.0040,0.0081, respectively. The quantification error per flow is also 
around Pq = 0.7606, 0.5863, 0.4028, respectively. As expected, more false alarms are declared at early 
iterations as the low-rank subspace has not been learnt accurately. Upon learning the subspace performance 
improves and almost all anomalies are identified. Careful inspection of Fig. 3 (b) reveals that the anomalies 
for fg are better identified visually than those for /2. As shown in Fig. 1, /2 is carried over links 
(1, 2), (2, 4), (4, 14), (14, 3) each one carrying 33,31,35,22 additional flows, respectively, whereas fg is 
aggregated over link (1, 3) with only 2 additional flows. Hence, identifying /2's anomalies from the highly- 
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superimposed load of links (1, 2), (2, 4), (4, 14), (14, 3) is a more challenging task relative to link (1, 3). 
This simple example manifests the fact that the detection performance strongly depends on the network 
topology and the routing policy implemented, which determine the routing matrix. In accordance with [25], 
the coherence of sparse column subsets of the routing matrix plays an important role in identifying the 
anomalies. In essence, the more incoherent the column subsets of R are, the better recovery performance 
one can attain. An intriguing question left here to address in future research pertains to desirable network 
topologies giving rise to incoherent routing matrices. 

Tracking routing changes. The measurement model in (8) has two time-varying attributes which challenge 
the identification of anomalies. The first one is missing measurement data arising from e.g., packet losses 
during the data collection process, and the second one pertains to routing changes due to e.g., network 
congestion or link failures. It is thus important to test whether the proposed online algorithm succeeds in 
tracking these changes. As discussed earlier, missing data are sampled uniformly at random. To assess the 
impact of routing changes on the recovery performance, a simple probabilistic model is adopted where 
each time instant a single link fails, or, returns to the operational state. Let <1> denote the adjacency matrix 
of the network graph G, where = 1 if there exists a physical link joining nodes i and j, and zero 

otherwise. Similarly, the active links involved in routing the data at time t are represented by the effective 
adjacency matrix <&£ ff . At time instant t + 1, a biased coin is tossed with small success probability a, and 
one of the links, say 6 ^? ff , is chosen uniformly at random and removed from G while ensuring 
that the network remains connected. Likewise, an edge (I, k) G <&\<l>£ ff is added with the same probability 
a. The resulting adjacency matrix is then = 3>£ ff + 1^ t }^e^'k ~ 1{&i t } e « e j> wnere the indicator 

function tr xe x} equals one when x € X, and zero otherwise; and &i,f,&2,t ~ Ber(a) are i.i.d. Bernoulli 
random variables. The minimum hop-count algorithm is then applied to ^fS^ to update the routing matrix 
Rt+i. Note that R m = K t with probability (1 - a) 2 . 

The performance is tested here for fast and slowly varying routing corresponding to a = 0.1 and 
a = 0.01, respectively, when j3 = 0.9. A metric of interest is the average square error in estimating 
the anomalies, namely e" := \ Y^i=i W&t ~ a *ll2> an( ^ the un ^ traffic, namely ef := \ Y^\=\ II** — x tlli- 
Fig. 4 (a) plots the average estimation error for various noise variances and amounts of missing data. 
The estimation error decreases quickly and after learning the subspace it becomes almost invariant. To 
evaluate the support recovery performance of the online estimator, define the average detection and false 
alarm rate 



r=l 2^/=l X-{a f , T >Q.l,af, T >0.1} 



YX=1 Z)/=l 1 {a LT >0.1} 



Pfa ■ 




(22) 
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Fig. 4. Tracking routing changes for p = 0.005. (a) Evolution of average anomaly (dotted) and traffic (solid) estimation errors, 
(b) Evolution of average detection (solid) and false alarm (dotted) rates, (c) Estimated (red) versus true (blue) link traffic for three 
representative links, (d) Estimated (circle markers) versus true (solid) anomalies for three representative flows when n — 0.8, 
a = 10~ 5 , and a = 0.01. 



Inspecting Fig. 4 (b) one observes that for a = 0.01 and ir = 0.8, increasing the noise variance from 10 
to 10~ 2 lowers the detection probability by 10%. Moreover, when a = 10~ 5 and a = 0.01, dropping 
20% of the observations renders the estimator misdetect 11% more anomalies. The routing changes from 
a = 0.01 to a = 0.1 when a = 10~ 5 and tt = 0.8 comes with an adverse effect of about 6% detection-rate 
decrease. For a few representative network links and flows Fig. 4 (c) and (d) illustrate how Algorithm 2 
tracks the anomalies and link-level traffic. Note that in Fig. 4 (c) link 12 is dropped for the time period 
t G [220,420], and thus the traffic level becomes zero. The flows being carried over link 31 are also 
varying due to routing changes, which occur at time instants t = 220, 940 when the traffic is not tracked 
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accurately. 

B. Real network data tests 

Internet-2 network example. Real data including OD flow traffic levels are collected from the operation 
of the Internet-2 network (Internet backbone network across USA) [1], shown in Fig. 5. Flow traffic levels 
are recorded every 5-minute intervals, for a three-week operational period of Internet-2 during Dec. 8-28, 
2008 [1]. Internet-2 comprises N = 11 nodes, L = 41 links, and F = 121 flows. Given the OD flow traffic 
measurements, the link loads in Y are obtained through multiplication with the Internet-2 routing matrix, 
which in this case remains invariant during the three weeks of data acquisition [1]. Even though Y is 
"constructed" here from flow measurements, link loads can be typically acquired from SNMP traces [35]. 

The available OD flows are incomplete due to problems in the data collection process. In addition, flows 
can be modeled as the superposition of "clean" plus anomalous traffic, i.e., the sum of some unknown 
"ground-truth" low-rank and sparse matrices "P^(Xo + Ao). Therefore, setting R = Ip in (PI) one can 
first run the batch Algorithm 1 to estimate the "ground-truth" components {Xo, Ao}. The estimated Xo 
exhibits three dominant singular values, confirming the low-rank property of the nominal traffic matrix. 
To be on the conservative side, only important spikes with magnitude greater than the threshold level 
50 1| Y || f/LT are retained as benchmark anomalies (nonzero entries in Ao). 

Comparison with PCA-based batch estimators [20], [40]. To highlight the merits of the batch estimator 
(P3), its performance is compared with the spatial PCA-based schemes reported in [20] and [40]. These 
methods capitalize on the fact that the anomaly-free traffic matrix has low-rank, while the presence of 
anomalies considerably increases the rank of Y. Both algorithms rely on a two-step estimation procedure: 
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Fig. 6. Performance of the batch estimator for Internet-2 network data, (a) ROC curves of the proposed versus the PCA-based 
methods, (b) Amplitude of the true (blue) and estimated (red) anomalies for Pfa = 0.04 and Po = 0.93. 

(sl) perform PCA on the data Y to extract the (low-rank) anomaly-free link traffic matrix X; and (s2) 
declare anomalies based on the residual traffic Y := Y — X. The algorithms in [40] and [20] differ in 
the way (s2) is performed. On its operational phase, the algorithm in [20] declares the presence of an 
anomaly at time t, when the projection of y t onto the anomalous subspace exceeds a prescribed threshold. 
It is clear that the aforementioned method is unable to identify anomalous flows. On the other hand, the 
network anomography approach of [40] capitalizes on the sparsity of anomalies, and recovers the anomaly 
matrix by minimizing ||A||i, subject to the linear constraints Y = RA. 

The aforementioned methods require a priori knowledge on the rank of the anomaly-free traffic matrix, 
and assume there is no missing data. To carry out performance comparisons, the detection rate will be 
adopted as figure of merit, which measures the algorithm's success in identifying anomalies across both 
flows and time instants. ROC curves are depicted in Fig. 6 (a), for different values of the rank required 
to run the PCA-based methods. It is apparent that the estimator (P3) obtained via Algorithm 1 markedly 
outperforms both PCA-based methods in terms of detection performance. This is somehow expected, 
since (P3) advocates joint estimation of the anomalies and the nominal traffic matrix. For an instance of 
Ppa = 0.04 and = 0.93, Fig. 6 (b) illustrates the effectiveness of the proposed algorithm in terms of 
unveiling the anomalous flows and time instants. 

Online operation. Algorithm 2 is tested here with the Internet-2 network data under two scenarios: with 
and without missing data. For the incomplete data case, a randomly chosen subset of link counts with 
cardinality 0.15 x LT is discarded. The penalty parameters are tuned as Ai = 0.7 and A* = 1.4. The 
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Fig. 7. Performance of the online estimator for Internet-2 network data, (a) Evolution of average anomaly (dotted) and traffic 
(solid) estimation errors, (b) Evolution of average detection (solid) and false alarm (dotted) rates, (c) Estimated (red) versus true 
(blue) link traffic for thee representative links, (d) Estimated (circle markers) versus true (solid) anomalies for three representative 
flows when n — 0.85. 



evolution of the average anomaly and traffic estimation errors, and average detection and false alarm rates 
are depicted in Fig. 7 (a), (b), respectively. Note how in the case of full-data, after about a week the traffic 
subspace is accurately learned and the detection (false alarm) rates approach the values 0.72 (0.011). It is 
further observed that even with 15% missing data, the detection performance degrades gracefully. Finally, 
Fig. 7(c)[(d)] depicts how three representative link traffic levels [OD flow anomalies] are accurately tracked 
over time. 
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VII. Concluding remarks 

An online algorithm is developed in this paper to perform a critical network monitoring task termed 
dynamic anomalography , meaning to unveil traffic volume anomalies in backbone networks adaptively. 
Given link-level traffic measurements (noisy superpositions of OD flows) acquired sequentially in time, 
the goal is to construct a map of anomalies in real time, that summarizes the network 'health state' 
along both the flow and time dimensions. Online algorithms enable tracking of anomalies in nonstationary 
environments, typically arising due to to e.g., routing changes and missing data. The resultant online 
schemes offer an attractive alternative to batch algorithms, since they scale gracefully as the number of 
flows in the network grows, or, the time window of data acquisition increases. Comprehensive numerical 
tests with both synthetic and real network data corroborate the effectiveness of the proposed algorithms 
and their tracking capabilities, and show that they outperform existing workhorse approaches for network 
anomaly detection. 

Appendix 

A. Proof of Lemma 1: With Pi,P2 € C consider the function 

^(a.Pi.Pa) := i||F t (P 1 )(y i - R,a)||| - i||F i (P 2 )(y t - R*a)||i (23) 

where F t (P) := [Sl t [I L - PD t (P)] ft t , ^,D;(P)]' and D t (P) := (AJ P + PUP)^P' It can 
be readily inferred from a5) that 

u t (a t (P 2 ), Pi, P 2 ) - u t (a t (Pi), Pi, P 2 ) > c ||a t (P 2 ) - a*(Pi)||l (24) 

for some positive constant cq. The rest of the proof deals with Lipschitz continuity of u t (.,Pi,P 2 ). For 
ai and a 2 from a compact set A, consider 

2Ma 1 ,P 1 ,P 2 )-u t (a 2 ,P 1 ,P 2 )| =2<P4 [Fj(P 2 )F t (P 2 ) - F" t (Pi)F t (Pi)] , (a 2 - ai )yJ) 

+ (HFtCPORtailll - HF^PijRtaalll) - (||F t (P 2 )R iai ||2 - ||F t (P 2 )R i a 2 ||l) . (25) 
Introducing the auxiliary variable A a := a 2 — ai, the last two summands in (25) can be bounded as 
||F t (Pi)R 4 ai||l - HFtCPORtaalli - ||F i (P 2 )R t a 1 ||2 + ||F t (P 2 )R t a 2 ||2 

= (HF^PORtAJl- \\F t (P 2 )R t A a \\l) +2(R; [^(P 2 )F t (P 2 ) - F^(P 1 )F t (P 1 )] ,a 2 A' a ) 
< c 1 ||Pt(P 2 ) - FtCPOHHAall! + c 2 ||F;(P 2 )F i (P 2 ) - FjCPOFtCPOIIHAalla 
< C3 ||F i (P 2 )-F i (P 1 )||||A a || 2 (26) 
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for some constants ci , C2 , C3 > 0, since ||F 4 (P)|| for P £ C, ||A a ||2, [ | £>-2 1 1 2 for ai, a2 G A, and ||Rt|| are all 
uniformly bounded. The first summand on the right-hand side of (25) is similarly bounded (details omitted 
here). Next, to establish that F t (P) is Lipschitz one can derive the following bound (A/, := P2 — Pi) 

||F t (p 2 ) - F t (Pi)\\ < \\n t [p 2 D t (p 2 ) - PiD t (Pi)] n t || + ^/K\\n t (D' t (p 2 ) - d^(Pi))|| 
< ||Pi||(||Pi|| + v^)||(A,i p + P^t^r 1 - (Aj p + p'^tPi)- 1 !! 

+ ||A L ||(||Pi|| + ||P 2 || + y/K)\\(KIp + P 2 ntP2) _1 ||. (27) 

Upon introducing G' t G t := A' L ft t Pi + A' L fl t A L + P' i n t A L and H 4 := A*I p + PfitP' by utilizing the 
matrix inversion lemma, the first term is bounded as follows 

||(A J p + PJAP2)- 1 - (Kl P + P'^tPi)" 1 !! = HH^CPiJGtCI + G' t Ut 1 (P 1 )G t )- 1 G' t Ilt 1 (P 1 )\\ 

< \\U^\P 1 )f\\GMl + G^\P 1 )G t )-^ 

< (J-^J ||G t || 2 < C4 ||A L ||. (28) 

Putting the pieces together F t (.) is found to be Lipschitz and subsequently (25) is bounded by a constant 
factor of || Ai| 1 1| A a ||2- Substituting ai = a t (Pi) and a2 = at(P 2 ) along with the bound in (24) yields the 
desired result ||a i (P 2 ) — at(Pi)|| 2 < C5||P 2 — Pi||. Furthermore, from the relationship q t = D t (P)fi t (y t — 
Rta t ), Lipschitz continuity of qt(P) readily follows. 

Moreover, gt(P, q[t], a[t]) is a quadratic function on a compact set, and thus clearly Lipschitz continu- 
ous. To prove Lipschitz continuity of £t(P), recall the definition {q t (P), a. t (P)} = arg min/q a ) gt(P, q, a) 
to obtain after some algebra 

£ t (P 2 ) -4(P0 = ^ t (P2q*(P 2 ) + Rta f (P 2 ))||| - ||^ f (Piq t (Pi) + Rta t (P 1 ))||2 
- (Vn t (yt), P 2 q t (P 2 ) + Rta t (P 2 ) - Piq*(Pi) - R*a t (Pi)) 
+ y (||qt(Pa)||l - l|q*(Pi)Hl) + Ai(||a t (P 2 )||i - || 1) - (29) 

The first term in the right-hand side of (29) is bounded as 
||P nt (P 2 q t (P 2 ) + R t a t (P 2 ))|| 2 - ||Pn t (Piqt(Pi) + Rta t (Pi))||i < 

(||Pn t (P 2qi (P 2 ) - Piq f (Pi))|| 2 + ||^ t (Rta t (P 2 ) - RtatCPiJJUa) 
x (||7>n t (P 2 q f (P 2 ) + Rta t (P 2 ))|| 2 + ||Pn t (Piq*(Pi) + R^(Pi))|| 2 ) 

< C6(||P 2 -Pi||||qt(P 2 )|| 2 + ||Pi||||q t (P 2 ) -q t (Pi)|| 2 + \\R t || ||a t (P 2 ) - a t (Pi)h) 

(30) 
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for some constant cq > 0. The second one is bounded as 

(7^ t (yt),P 2 qt(P 2 ) + IWP 2 ) - Piqt(Pi) - Rta*(Pi)> 

< ||7 ? n t (yt)|| 2 (rn t (P2qt(P2) -Piq*(Pi))|| 2 + ||7 ? n t (B 4 a t (P 2 ) - R^a^Pi)) || 2 ) 

< \\Vn t (yt)h{\\P2 ~ Pi||||qt(P 2 )|| 2 + ||Pi||||q t (P 2 ) - q*(Pi)|| 2 + ||R t ||||a t (P 2 ) - a^Px)^). 

(31) 

Finally, one can bound the third term in (29) as 

^(|MP 2 )||| - ||qt(Pi)||i) + A^HatOPa)!!! - \\*t(Pi)\\i) < 

y ||q t (P 2 ) - q*(Pi)|| 2 (||q t (P 2 )|| 2 + ||q*(Pi)|| 2 ) + \iVF\\at(P 2 ) ~ a t (Pi)|| 2 . 

(32) 

Since qt(P) and a t (P) are Lipschitz as proved earlier, and Pi,P 2 G C are uniformly bounded, the 
expressions in the right-hand side of (30)-(32) are upper bounded by a constant factor of ||P 2 — Pi||, and 
so is \£ t (P 2 ) — 4(Pi)| after applying the triangle inequality to (29). 

Regarding V-^(P), notice first that since (q t (P), a t (P)} is the unique minimizer of #t(P,q, a) [cf. 
a5)], Danskin's theorem [6, Prop. B.25(a)] implies that W t (P) = Vn t (yt - Pq<(P) - R*a t (P))qJ(P). 
In the sequel, the triangle inequality will be used to split the norm in the right-hand side of 

||W t (P 2 ) - W t (Pi)|| F =\\Vn t (yt) [q*(P 2 ) - q t (Pi)]' - [Pn t (P 2 q t (P 2 ))qUP 2 ) - Pn t (Px<lt(Pi)H(Pi)] 

- [Po t (R t a t (P 2 ))qJ(P 2 ) -Po t (R t a t (P 1 ))qJ(P 1 )] || F . (33) 

The first term inside the norm is bounded as 

WVuM [q*(P 2 ) " q*(Pi)]' \\f < ||^ t (yt)|| 2 ||qt(P 2 ) - qf(Pi)|| 2 . (34) 
After some algebraic manipulations, the second term is also bounded as 

||Po { (P 2 q t (P 2 ))qUP2)-Pn s (Piqt(PiM(Pi)||f <||P 2 -Pi||f||q t (P 2 )||l 

+ IMP2) - q t (Pi)|| 2 (||q t (P 2 )|| 2 + ||q t (Pi)|| 2 ) (35) 
and finally one can simply bound the third term as 
||Pn t (FWP2))qi(P2)-Po t (Rta t (P^ 

+ ||q t (P 2 )-q t (P 1 )|| 2 ||a t (P 1 )|| 2 ). (36) 

Since a t (P) and qt(P) are Lipschitz and uniformly bounded, from (34)-(36) one can easily deduce that 
Wt(.) is indeed Lipschitz continuous. ■ 
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B. Proof of Lemma 2 : Exploiting that VQ(P[t]) = VCt+i(P[i + 1]) = 0l xp by algorithmic construction 
and the strong convexity assumption on C t [cf. a4)], application of the mean- value theorem readily yields 

C t (P[t + l])>C t (P[t)) + ^\\P[t + l)-P[t]f F 

C t+1 (P[t)) > C t+1 (P[t + 1]) + |||P[i + 1] - P[t]\\%. 

Upon defining the function h t (P) := Cj(P) — C7 i+ i(P) one arrives at 

c\\P[t + 1] - P[i]||| < h t (P[t + 1]) - h t (P[t\). (37) 

To complete the proof, it suffices to show that h t is Lipschitz with constant 0(l/t), and upper bound the 
right-hand side of (37) accordingly. Since [cf. (16)] 

* 1 A 

- 9r(P, q[r], a[r]) - q[* + 1], a[* + 1]) + — 

T=l 



ht(P) 



(38) 



and ffi(P) is Lipschitz according to Lemma 1, it follows that ^ is Lipschitz with constant 0(l/t). ■ 
C. Proof of Lemma 3: The first step of the proof is to show that {Ct (Pft])}^ is a quasi-matringale 
sequence, and hence convergent a.s. [22]. Building on the variations of Cf(P[i]), one can write 

C t+1 (P[t + 1]) - C t (P[t}) = C t+1 (P[t + 1]) - <%+i(P[t]) + C t +i(P[t]) - C t (P[t]) 

< C t+1 (P[t])-C t (P[t)) 



t + l 



(b) 1 

< 



t + l 



9i+1 (P[t],q[t + l],a[t + l])-^ ffT (P[r],q[r],a[r]) 

1 r=l 

1 * 

g t +i(P[t], q[t + 1], a[t + 1]) - - 5]^(P[t]) 



T=l 



(39) 



where (a) uses that C t+1 (P[t + 1]) < C7 m (PM), and (b) follows from C t (P[i]) < C t (P[t}). 

Collect all past data in Ft = {(Q T , Yt) '■ t < t}, and recall that under al) the random processes {f2 f , y t } 
are i.i.d. over time. Then, the expected variations of the approximate cost function are bounded as 



E 



C t+1 (P[t + l])-C t (P[t])\T t 



< U[g t+1 (P[t],q.[t + l], a [t + l])\F t ]-^e T (P[ 

\ T=l 



(a) 1 



t + 



T (E[/i(P[t])]-±J>(P[t]) 

V T=l , 



< _L sup (E[4(PW)]-y^4(PM)) 
* + 1 P[t]ec V * ^i / 



(40) 



IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING (SUBMITTED) 



30 



where (a) follows from al). Using the fact that ^(Pf) is Lipschitz from Lemma 1, and uniformly bounded 
due to a2), Donsker's Theorem [38, Ch. 19.2] yields 



E 



suplEftCPfl)]-^ J>(P[*])| 
p[t] t * 



T = l 



0(1/Vt). 



From (40) and (41) the expected non-negative variations can be readily bounded as 



E 



E 



C t+1 (P[t + l])-C t (P[t\)\T t 



0(l/t 3/2 ) 



and consequently 



t=i 



E 



C t+1 (P[t + 1]) - C t (P[t])\T t 



< oo 



(41) 



(42) 



(43) 



which indeed proves that {C't(P[t])}^. 1 is a quasi-martingale sequence. 

To prove the second part, define first U t (P[t]) := C t (P[t]) - |f ||P[*]||| and U t (P[t]) := C t (P[t]) - 
|f || P for which U t (P[t]) - Ut{P[t}) = C t (P[t]) - Ct(P[t]) holds. Following similar arguments as 
with Q(P[i]), one can show that (43) holds for f7 t (P[t]) as well. It is also useful to expand the variations 



U t+1 (P[t + 1]) - U t (P[t]) = U t+l (P[t + 1]) - U t +i(P[t\) + 
and bound their expectation conditioned on Ft, to arrive at 

u tm) -u tm)< Ete+i(P[t + 1]) _^ i(PM)| ^ 



i t+ i(P[t]) - U t (P[ 

t + 1 



+ 



U t (P[t]) - U t (P[ 

t + 1 



t + l 



E 



+ 



E 



U t+1 (P[t + l])-U t (P[t])\Tt 



+ 



1 



t + l 



E| 



(p[*])]-jE^(p[ 



T=l 



(44) 



Focusing on the right-hand side of (44), the second and third terms are both 0(l/t 3 / 2 ) since counterparts 
of (41) and (42) also hold for U t (P[t]). With regards to the first term, using the fact that C t+ i(P[t + l}) < 
C t+1 (P[t]), from Lemma 1 and a4), it follows that U t+ i{P[t + 1]) - U t+ i(P[t}) = o(l/i). All in all, 

£ ft ' P M» - U ' (P[t]) < oo ,s. ,45) 
Defining d t (P[t]) := U t (P[t]) — Ut(P[t]), due to Lipschitz continuity of £ t and gt (cf. Lemma 1), and 



uniform boundedness of {Pt}£^i [cf a3)], invoking Lemma 2 one can establish that dt+i(P[t + 1]) 
dt(P[t]) = 0(l/t). Hence, Dirichlet's theorem [33] applied to the sum (45) asserts that lim^oo dt(P[t]) 
a.s., and consequently lim i _ i . 00 (C't(P[t]) — Ct(P[t])) = a.s. 
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